implement file splitting functionality and enhance documentation#51
Merged
whhe merged 27 commits intooceanbase:mainfrom Feb 25, 2026
Merged
implement file splitting functionality and enhance documentation#51whhe merged 27 commits intooceanbase:mainfrom
whhe merged 27 commits intooceanbase:mainfrom
Conversation
…nitial SDK configuration
…ackage configuration
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…f 'markdown' for parsed results
|
Documentation Updates 2 document(s) were updated by changes in this PR: Markdown Processing and ChunkingView Changes@@ -1,12 +1,21 @@
-PowerRAG provides a robust system for processing markdown documents, supporting multiple chunking strategies to optimize retrieval-augmented generation (RAG) workflows. The system is designed to preserve document structure, handle complex markdown elements, and manage chunk sizes for downstream tasks such as embedding or LLM input.
+PowerRAG provides a robust system for processing documents, supporting multiple chunking strategies to optimize retrieval-augmented generation (RAG) workflows. The system is designed to preserve document structure, handle complex markdown elements, and manage chunk sizes for downstream tasks such as embedding or LLM input.
### Architecture and Chunking Strategies
-The core service for markdown chunking is `PowerRAGSplitService`, which exposes a unified interface for splitting text using different strategies, selected via the `parser_id` parameter. Supported strategies include:
+The core service is `PowerRAGSplitService`, which exposes two primary capabilities:
-- **Title-based chunking**: Splits content at markdown headers of a specified level, preserving section boundaries and returning both chunk content and associated titles.
-- **Regex-based chunking**: Splits text using a configurable regex pattern, then merges or further splits chunks based on token thresholds.
-- **Smart chunking**: Uses an AST-based approach to parse markdown structure, intelligently chunking by headings, containers (lists, tables), and token counts.
+1. **Text Splitting** (`split_text`): For markdown/text content using three specialized parsers
+2. **File Splitting** (`split_file`, `split_file_upload`): For files using all available ParserType methods
+
+#### Text Splitting
+
+The `split_text` method supports three specialized parsers for markdown and text content, selected via the `parser_id` parameter:
+
+- **Title-based chunking** (`title`): Splits content at markdown headers of a specified level, preserving section boundaries and returning both chunk content and associated titles.
+- **Regex-based chunking** (`regex`): Splits text using a configurable regex pattern, then merges or further splits chunks based on token thresholds.
+- **Smart chunking** (`smart`): Uses an AST-based approach to parse markdown structure, intelligently chunking by headings, containers (lists, tables), and token counts.
+
+**Note**: Only these three parsers are supported for `split_text`. For other parsers (such as `naive`, `book`, `qa`, `paper`, etc.), use the file splitting methods.
Example usage:
```python
@@ -31,6 +40,71 @@
Smart chunking parses the markdown document into an abstract syntax tree (AST) using `MarkdownIt`. It recursively processes AST nodes, treating headings as chunk boundaries and preserving containers such as lists, tables, and code blocks. Chunks are merged or split based on token counts and document structure. Large chunks are split first by headings, then by newlines, ensuring each chunk is close to the target token size and titles are preserved as prefixes.
+#### File Splitting
+
+The `split_file` and `split_file_upload` methods support all available ParserType methods, providing comprehensive file chunking capabilities. These methods work with local files, file URLs, and file uploads, supporting various document types including PDFs, Office documents, images, and HTML.
+
+**Supported ParserType Methods**:
+- **Basic parsers**: `naive`, `title`, `regex`, `smart`
+- **Specialized parsers**: `qa`, `book`, `laws`, `paper`, `manual`, `presentation`
+- **Format-specific parsers**: `table`, `resume`, `picture`, `one`, `email`
+
+The file splitting methods internally initialize a file chunker factory (`_init_file_chunker_factory`) that maps each ParserType to its corresponding chunking module from the `rag/app` and `powerrag/app` packages.
+
+**Usage Examples**:
+
+Using a local file path:
+```python
+service = PowerRAGSplitService()
+result = service.split_file(
+ filename="/path/to/document.pdf",
+ parser_id="book",
+ config={"chunk_token_num": 512, "delimiter": "\n。.;;!!??"}
+)
+```
+
+Using a file URL:
+```python
+result = service.split_file(
+ filename="https://example.com/doc.pdf",
+ binary=None, # Binary will be downloaded
+ parser_id="naive",
+ config={
+ "chunk_token_num": 256,
+ "max_file_size": 128 * 1024 * 1024, # 128MB
+ "download_timeout": 300, # 5 minutes
+ "head_request_timeout": 30 # 30 seconds
+ }
+)
+```
+
+Using file upload (via API):
+```python
+# Read file binary
+with open("document.pdf", "rb") as f:
+ binary = f.read()
+
+result = service.split_file(
+ filename="document.pdf",
+ binary=binary,
+ parser_id="book",
+ config={"chunk_token_num": 512}
+)
+```
+
+**Configuration Parameters for File Splitting**:
+- `chunk_token_num`: Target chunk size in tokens (default: 512)
+- `delimiter`: Delimiters for splitting large chunks (default: `"\n。.;;!!??"`)
+- `lang`: Language for processing (default: `"Chinese"`)
+- `from_page`: Starting page number for PDF processing (default: 0)
+- `to_page`: Ending page number for PDF processing (default: 100000)
+- `max_file_size`: Maximum file size for URL downloads in bytes (file URL only)
+- `download_timeout`: Download timeout in seconds for file URLs (file URL only)
+- `head_request_timeout`: HEAD request timeout in seconds for file URLs (file URL only)
+
+The file splitting methods return chunks as a list of strings, along with metadata including the parser ID, total chunk count, and filename.
+[Source](https://github.com/oceanbase/powerrag/blob/a97000b728952b4bb42d01a2fc672b07bd0da6ec/powerrag/server/services/split_service.py#L41-L1386)
+
### Handling Markdown Elements
Markdown elements, especially images, are carefully preserved during chunking. In smart chunking, image nodes are reconstructed using their `alt` and `src` attributes to produce the correct markdown syntax (``). This ensures that image source links are not lost during chunking, addressing previous bugs where image sources were dropped in smart chunks [PR #11](https://github.com/oceanbase/powerrag/pull/11).
@@ -47,7 +121,9 @@
### Customizing and Extending Chunking Behavior
-Chunking behavior can be customized by selecting the appropriate `parser_id` (`title`, `regex`, `smart`) and configuring parameters such as:
+#### Text Splitting Configuration
+
+For text splitting with `split_text`, chunking behavior can be customized by selecting the appropriate `parser_id` (`title`, `regex`, `smart`) and configuring parameters such as:
- `title_level`: Markdown header level for splitting (title-based chunking).
- `chunk_token_num`: Target chunk size in tokens.
@@ -63,14 +139,37 @@
config={"chunk_token_num": 256, "min_chunk_tokens": 64}
)
```
+
+#### File Splitting Configuration
+
+For file splitting with `split_file` or `split_file_upload`, all ParserType methods are available. Configuration parameters vary by parser but commonly include:
+
+- `chunk_token_num`: Target chunk size in tokens (default: 512)
+- `delimiter`: Delimiters for splitting large chunks
+- `lang`: Language for processing
+- `from_page`, `to_page`: Page range for PDF processing
+- `max_file_size`, `download_timeout`, `head_request_timeout`: URL download settings
+
+Example using the `book` parser:
+```python
+result = service.split_file(
+ filename="/path/to/book.pdf",
+ parser_id="book",
+ config={"chunk_token_num": 512, "lang": "Chinese"}
+)
+```
[Source](https://github.com/oceanbase/powerrag/blob/a97000b728952b4bb42d01a2fc672b07bd0da6ec/powerrag/server/services/split_service.py#L41-L1386)
+
+### Extending Chunking Logic
+
+To extend text chunking logic, implement a new chunker function and register it in the `CHUNKER_FACTORY` mapping within `PowerRAGSplitService`. Ensure your chunker accepts a configuration dictionary and returns chunks in the expected format. You may also customize AST node handling to support additional markdown elements or protected regions.
+
+To add support for new file parsers, implement a chunking module following the pattern of existing modules in `rag/app` or `powerrag/app`, and register it in the `_file_chunker_factory` mapping during initialization.
+
+### Binary File Parsing
The system also supports parsing binary files (PDF, Office documents, images, HTML) into markdown, returning the markdown content, images (as base64), and metadata. The parsing configuration can specify layout recognition engines, formula and table recognition, and page ranges for PDFs [PR #40](https://github.com/oceanbase/powerrag/pull/40).
-### Extending Chunking Logic
-
-To extend chunking logic, implement a new chunker function and register it in the `CHUNKER_FACTORY` mapping within `PowerRAGSplitService`. Ensure your chunker accepts a configuration dictionary and returns chunks in the expected format. You may also customize AST node handling to support additional markdown elements or protected regions.
-
---
-For further details, refer to the [split_service.py implementation](https://github.com/oceanbase/powerrag/blob/a97000b728952b4bb42d01a2fc672b07bd0da6ec/powerrag/server/services/split_service.py#L41-L1386) and relevant [pull requests](https://github.com/oceanbase/powerrag/pull/11).
+For further details, refer to the [split_service.py implementation](https://github.com/oceanbase/powerrag/blob/a97000b728952b4bb42d01a2fc672b07bd0da6ec/powerrag/server/services/split_service.py#L41-L1386) and relevant pull requests: [PR #11](https://github.com/oceanbase/powerrag/pull/11), [PR #40](https://github.com/oceanbase/powerrag/pull/40), [PR #51](https://github.com/oceanbase/powerrag/pull/51).PowerRAG SDKView Changes@@ -9,6 +9,7 @@
- Parse documents to Markdown format, including direct binary parsing for PDF, Office documents, images, and HTML ([source](https://github.com/oceanbase/powerrag/pull/40)).
- Asynchronous and synchronous document parsing, with status polling and cancellation.
- Manage document metadata, download content, and handle document chunks.
+- Split text and files into chunks using various parser methods, including support for local files, file URLs, and file uploads.
### Knowledge Base Management
- Create, update, list, and delete chat sessions and agents.
@@ -104,6 +105,20 @@
- `use_kg`: Use knowledge graph (bool)
- `toc_enhance`: Table of contents enhancement (bool)
+### File Splitting
+- `parser_id`: Parser method ID (str)
+ - For text splitting (`split_text`): Only supports `title`, `regex`, `smart`
+ - For file splitting (`split_file`, `split_file_upload`): Supports all ParserType methods including `naive`, `title`, `regex`, `smart`, `qa`, `book`, `laws`, `paper`, `manual`, `presentation`, `table`, `resume`, `picture`, `one`, `email`
+- `chunk_token_num`: Target chunk size in tokens (int, default 512)
+- `delimiter`: Delimiter string (str, default `"\n。.;;!!??"`)
+- `lang`: Language (str, default `"Chinese"`)
+- `from_page`, `to_page`: Page range for PDFs (int, default 0 and 100000)
+- `file_path`: Local file path (str, optional, for `split_file`)
+- `file_url`: Remote file URL (str, optional, for `split_file`)
+- `max_file_size`: Maximum file size in bytes for URL downloads (int, optional, default 128MB)
+- `download_timeout`: Download timeout in seconds (int, optional, default 300)
+- `head_request_timeout`: HEAD request timeout in seconds (int, optional, default 30)
+
## Usage Examples
### Create a Dataset and Upload Documents
@@ -189,12 +204,82 @@
print(chunk.content)
```
+### Split Text into Chunks
+```python
+# Text splitting only supports: title, regex, smart
+result = client.chunk.split_text(
+ text="# Chapter 1\n\nThis is the content of chapter 1.\n\n# Chapter 2\n\nThis is chapter 2.",
+ parser_id="title", # Only: title, regex, or smart
+ config={"chunk_token_num": 512}
+)
+
+print(f"Total chunks: {result['total_chunks']}")
+for chunk in result['chunks']:
+ print(chunk) # chunks are strings
+```
+
+### Split Files into Chunks
+```python
+# Method 1: Split file from local path (server must have access to the path)
+result = client.chunk.split_file(
+ file_path="/path/to/document.pdf",
+ parser_id="book", # Supports all ParserType methods
+ config={
+ "chunk_token_num": 512,
+ "delimiter": "\n。.;;!!??",
+ "lang": "Chinese",
+ "from_page": 0,
+ "to_page": 100
+ }
+)
+
+# Method 2: Split file from URL
+result = client.chunk.split_file(
+ file_url="https://example.com/document.pdf",
+ parser_id="naive",
+ config={
+ "chunk_token_num": 256,
+ "max_file_size": 128 * 1024 * 1024, # 128MB
+ "download_timeout": 300,
+ "head_request_timeout": 30
+ }
+)
+
+# Method 3: Upload file and split
+result = client.chunk.split_file_upload(
+ file_path="/path/to/local/document.pdf",
+ parser_id="book",
+ config={"chunk_token_num": 512}
+)
+
+print(f"Total chunks: {result['total_chunks']}")
+print(f"Filename: {result['filename']}")
+print(f"Parser used: {result['parser_id']}")
+for chunk in result['chunks']:
+ print(chunk) # chunks are strings
+```
+
+**Supported ParserType methods for file splitting:**
+- Basic: `naive`, `title`, `regex`, `smart`
+- Professional: `qa`, `book`, `laws`, `paper`, `manual`, `presentation`
+- Special formats: `table`, `resume`, `picture`, `one`, `email`
+
+**Return value structure:**
+```python
+{
+ "parser_id": "book",
+ "chunks": ["chunk1", "chunk2", ...], # List of strings
+ "total_chunks": 10,
+ "filename": "document.pdf"
+}
+```
+
## Integration Guidelines
1. Install the SDK via pip.
2. Import `PowerRAGClient` from `powerrag.sdk`.
3. Initialize the client with your API key and server URL.
-4. Use resource objects (`dataset`, `document`, `chat`, `agent`) and their methods for all operations.
-5. Configure advanced options as needed for parsing, retrieval, and chat/agent creation.
+4. Use resource objects (`dataset`, `document`, `chat`, `agent`, `chunk`) and their methods for all operations.
+5. Configure advanced options as needed for parsing, retrieval, file splitting, and chat/agent creation.
6. Handle exceptions as raised by SDK methods for error management.
7. Refer to type annotations and docstrings for IDE assistance.
@@ -208,3 +293,69 @@
## License
PowerRAG SDK is licensed under Apache-2.0 ([source](https://github.com/oceanbase/powerrag/pull/27)).
+
+## Frequently Asked Questions
+
+### What is the difference between text splitting and file splitting methods?
+
+The SDK provides three different methods for chunking content:
+
+**`split_text`**: Text-only splitting
+- Only supports three parser methods: `title`, `regex`, `smart`
+- Designed for plain text or Markdown content
+- No file handling required
+
+**`split_file`**: File splitting via path or URL
+- Supports all ParserType methods (15+ parsers)
+- Can process files from local paths (`file_path`) or remote URLs (`file_url`)
+- Server must have access to local paths when using `file_path`
+
+**`split_file_upload`**: Upload and split
+- Supports all ParserType methods (15+ parsers)
+- Uploads file from local system to server before splitting
+- Best for local files when server doesn't have direct access
+
+**When to use each method:**
+
+Use `split_text` when:
+- You have plain text or Markdown content
+- You only need `title`, `regex`, or `smart` parsers
+- You don't have a file to process
+
+Use `split_file` when:
+- You need parsers other than `title`, `regex`, or `smart` (e.g., `book`, `qa`, `naive`, `paper`)
+- The file is accessible via a URL
+- The file is on the server's filesystem (accessible via `file_path`)
+
+Use `split_file_upload` when:
+- You need parsers other than `title`, `regex`, or `smart`
+- The file is on your local machine
+- The server doesn't have direct access to the file path
+
+**Examples:**
+
+```python
+# Text splitting (only title, regex, smart)
+result = client.chunk.split_text(
+ text="# Chapter 1\n\nContent...",
+ parser_id="title"
+)
+
+# File splitting from local path
+result = client.chunk.split_file(
+ file_path="/server/path/doc.pdf",
+ parser_id="book" # Can use any parser
+)
+
+# File splitting from URL
+result = client.chunk.split_file(
+ file_url="https://example.com/doc.pdf",
+ parser_id="naive"
+)
+
+# Upload and split
+result = client.chunk.split_file_upload(
+ file_path="/local/path/doc.pdf",
+ parser_id="qa"
+)
+``` |
Contributor
There was a problem hiding this comment.
Pull request overview
This PR implements file splitting functionality for PowerRAG, enabling users to chunk documents via local paths, URLs, and uploads using various parser types. The changes enhance the existing text-only splitting capabilities by adding support for file-based parsing methods from the rag/app module.
Changes:
- Added
split_fileandsplit_file_uploadmethods to support file chunking through local paths, URLs, and direct uploads - Enhanced error handling with more descriptive messages for unsupported parsers in text splitting
- Updated documentation with comprehensive examples distinguishing between text and file splitting methods
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 18 comments.
Show a summary per file
| File | Description |
|---|---|
| powerrag/server/services/split_service.py | Adds _init_file_chunker_factory and split_file method to support file-based chunking with all parser types; improves error messages for unsupported text parsers |
| powerrag/server/routes/powerrag_routes.py | Implements /split/file and /split/file/upload endpoints; changes ConnectionError status code from 503 to 400 |
| powerrag/sdk/modules/chunk_manager.py | Adds split_file and split_file_upload client methods with support for both file paths and URLs |
| powerrag/sdk/tests/test_chunk.py | Adds tests for file splitting upload functionality and unsupported parser error handling; changes test parser from "naive" to "regex" |
| powerrag/sdk/tests/test_document.py | Adds sleep delays for async operation timing in cancel_parse test |
| powerrag/sdk/README.md | Comprehensive documentation updates clarifying split_text vs split_file usage, with examples for all three file splitting methods |
| api/apps/sdk/powerrag_proxy.py | Adds proxy endpoints for split_file operations; improves file handling with BytesIO wrapper for async file reading |
ca05bc5 to
15d378b
Compare
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…rrag-github into powerrag_sdk_api
whhe
approved these changes
Feb 25, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
split_fileandsplit_file_uploadmethods to support file chunking via local paths, URLs, and uploads.README.mdto include detailed examples for text and file splitting methods.Solution Description